Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 42
Filtrar
1.
J Physiol ; 601(21): 4767-4806, 2023 11.
Artigo em Inglês | MEDLINE | ID: mdl-37786382

RESUMO

Comprehensive and accurate analysis of respiratory and metabolic data is crucial to modelling congenital, pathogenic and degenerative diseases converging on autonomic control failure. A lack of tools for high-throughput analysis of respiratory datasets remains a major challenge. We present Breathe Easy, a novel open-source pipeline for processing raw recordings and associated metadata into operative outcomes, publication-worthy graphs and robust statistical analyses including QQ and residual plots for assumption queries and data transformations. This pipeline uses a facile graphical user interface for uploading data files, setting waveform feature thresholds and defining experimental variables. Breathe Easy was validated against manual selection by experts, which represents the current standard in the field. We demonstrate Breathe Easy's utility by examining a 2-year longitudinal study of an Alzheimer's disease mouse model to assess contributions of forebrain pathology in disordered breathing. Whole body plethysmography has become an important experimental outcome measure for a variety of diseases with primary and secondary respiratory indications. Respiratory dysfunction, while not an initial symptom in many of these disorders, often drives disability or death in patient outcomes. Breathe Easy provides an open-source respiratory analysis tool for all respiratory datasets and represents a necessary improvement upon current analytical methods in the field. KEY POINTS: Respiratory dysfunction is a common endpoint for disability and mortality in many disorders throughout life. Whole body plethysmography in rodents represents a high face-value method for measuring respiratory outcomes in rodent models of these diseases and disorders. Analysis of key respiratory variables remains hindered by manual annotation and analysis that leads to low throughput results that often exclude a majority of the recorded data. Here we present a software suite, Breathe Easy, that automates the process of data selection from raw recordings derived from plethysmography experiments and the analysis of these data into operative outcomes and publication-worthy graphs with statistics. We validate Breathe Easy with a terabyte-scale Alzheimer's dataset that examines the effects of forebrain pathology on respiratory function over 2 years of degeneration.


Assuntos
Respiração , Software , Animais , Camundongos , Humanos , Estudos Longitudinais , Pletismografia
2.
Bioinformatics ; 39(10)2023 10 03.
Artigo em Inglês | MEDLINE | ID: mdl-37792497

RESUMO

MOTIVATION: Nuclear magnetic resonance spectroscopy (NMR) is widely used to analyze metabolites in biological samples, but the analysis requires specific expertise, it is time-consuming, and can be inaccurate. Here, we present a powerful automate tool, SPatial clustering Algorithm-Statistical TOtal Correlation SpectroscopY (SPA-STOCSY), which overcomes challenges faced when analyzing NMR data and identifies metabolites in a sample with high accuracy. RESULTS: As a data-driven method, SPA-STOCSY estimates all parameters from the input dataset. It first investigates the covariance pattern among datapoints and then calculates the optimal threshold with which to cluster datapoints belonging to the same structural unit, i.e. the metabolite. Generated clusters are then automatically linked to a metabolite library to identify candidates. To assess SPA-STOCSY's efficiency and accuracy, we applied it to synthesized spectra and spectra acquired on Drosophila melanogaster tissue and human embryonic stem cells. In the synthesized spectra, SPA outperformed Statistical Recoupling of Variables (SRV), an existing method for clustering spectral peaks, by capturing a higher percentage of the signal regions and the close-to-zero noise regions. In the biological data, SPA-STOCSY performed comparably to the operator-based Chenomx analysis while avoiding operator bias, and it required <7 min of total computation time. Overall, SPA-STOCSY is a fast, accurate, and unbiased tool for untargeted analysis of metabolites in the NMR spectra. It may thus accelerate the use of NMR for scientific discoveries, medical diagnostics, and patient-specific decision making. AVAILABILITY AND IMPLEMENTATION: The codes of SPA-STOCSY are available at https://github.com/LiuzLab/SPA-STOCSY.


Assuntos
Drosophila melanogaster , Imageamento por Ressonância Magnética , Animais , Humanos , Espectroscopia de Ressonância Magnética/métodos , Análise por Conglomerados , Metabolômica/métodos
3.
PLoS Genet ; 19(5): e1010760, 2023 05.
Artigo em Inglês | MEDLINE | ID: mdl-37200393

RESUMO

Heterozygous variants in the glucocerebrosidase (GBA) gene are common and potent risk factors for Parkinson's disease (PD). GBA also causes the autosomal recessive lysosomal storage disorder (LSD), Gaucher disease, and emerging evidence from human genetics implicates many other LSD genes in PD susceptibility. We have systemically tested 86 conserved fly homologs of 37 human LSD genes for requirements in the aging adult Drosophila brain and for potential genetic interactions with neurodegeneration caused by α-synuclein (αSyn), which forms Lewy body pathology in PD. Our screen identifies 15 genetic enhancers of αSyn-induced progressive locomotor dysfunction, including knockdown of fly homologs of GBA and other LSD genes with independent support as PD susceptibility factors from human genetics (SCARB2, SMPD1, CTSD, GNPTAB, SLC17A5). For several genes, results from multiple alleles suggest dose-sensitivity and context-dependent pleiotropy in the presence or absence of αSyn. Homologs of two genes causing cholesterol storage disorders, Npc1a / NPC1 and Lip4 / LIPA, were independently confirmed as loss-of-function enhancers of αSyn-induced retinal degeneration. The enzymes encoded by several modifier genes are upregulated in αSyn transgenic flies, based on unbiased proteomics, revealing a possible, albeit ineffective, compensatory response. Overall, our results reinforce the important role of lysosomal genes in brain health and PD pathogenesis, and implicate several metabolic pathways, including cholesterol homeostasis, in αSyn-mediated neurotoxicity.


Assuntos
Doença de Parkinson , alfa-Sinucleína , Animais , Humanos , alfa-Sinucleína/genética , alfa-Sinucleína/metabolismo , Animais Geneticamente Modificados , Drosophila/genética , Drosophila/metabolismo , Glucosilceramidase/genética , Glucosilceramidase/metabolismo , Lisossomos/metabolismo , Doença de Parkinson/patologia , Transferases (Outros Grupos de Fosfato Substituídos)/metabolismo , Envelhecimento/metabolismo
4.
Biometrics ; 79(4): 3846-3858, 2023 12.
Artigo em Inglês | MEDLINE | ID: mdl-36950906

RESUMO

Clustering has long been a popular unsupervised learning approach to identify groups of similar objects and discover patterns from unlabeled data in many applications. Yet, coming up with meaningful interpretations of the estimated clusters has often been challenging precisely due to their unsupervised nature. Meanwhile, in many real-world scenarios, there are some noisy supervising auxiliary variables, for instance, subjective diagnostic opinions, that are related to the observed heterogeneity of the unlabeled data. By leveraging information from both supervising auxiliary variables and unlabeled data, we seek to uncover more scientifically interpretable group structures that may be hidden by completely unsupervised analyses. In this work, we propose and develop a new statistical pattern discovery method named supervised convex clustering (SCC) that borrows strength from both information sources and guides towards finding more interpretable patterns via a joint convex fusion penalty. We develop several extensions of SCC to integrate different types of supervising auxiliary variables, to adjust for additional covariates, and to find biclusters. We demonstrate the practical advantages of SCC through simulations and a case study on Alzheimer's disease genomics. Specifically, we discover new candidate genes as well as new subtypes of Alzheimer's disease that can potentially lead to better understanding of the underlying genetic mechanisms responsible for the observed heterogeneity of cognitive decline in older adults.


Assuntos
Doença de Alzheimer , Humanos , Idoso , Doença de Alzheimer/genética , Genômica , Análise por Conglomerados
5.
bioRxiv ; 2023 Feb 22.
Artigo em Inglês | MEDLINE | ID: mdl-36865102

RESUMO

Nuclear Magnetic Resonance (NMR) spectroscopy is widely used to analyze metabolites in biological samples, but the analysis can be cumbersome and inaccurate. Here, we present a powerful automated tool, SPA-STOCSY (Spatial Clustering Algorithm - Statistical Total Correlation Spectroscopy), which overcomes the challenges by identifying metabolites in each sample with high accuracy. As a data-driven method, SPA-STOCSY estimates all parameters from the input dataset, first investigating the covariance pattern and then calculating the optimal threshold with which to cluster data points belonging to the same structural unit, i.e. metabolite. The generated clusters are then automatically linked to a compound library to identify candidates. To assess SPA-STOCSY’s efficiency and accuracy, we applied it to synthesized and real NMR data obtained from Drosophila melanogaster brains and human embryonic stem cells. In the synthesized spectra, SPA outperforms Statistical Recoupling of Variables, an existing method for clustering spectral peaks, by capturing a higher percentage of the signal regions and the close-to-zero noise regions. In the real spectra, SPA-STOCSY performs comparably to operator-based Chenomx analysis but avoids operator bias and performs the analyses in less than seven minutes of total computation time. Overall, SPA-STOCSY is a fast, accurate, and unbiased tool for untargeted analysis of metabolites in the NMR spectra. As such, it might accelerate the utilization of NMR for scientific discoveries, medical diagnostics, and patient-specific decision making.

6.
PLoS Comput Biol ; 18(10): e1010577, 2022 10.
Artigo em Inglês | MEDLINE | ID: mdl-36191044

RESUMO

Consensus clustering has been widely used in bioinformatics and other applications to improve the accuracy, stability and reliability of clustering results. This approach ensembles cluster co-occurrences from multiple clustering runs on subsampled observations. For application to large-scale bioinformatics data, such as to discover cell types from single-cell sequencing data, for example, consensus clustering has two significant drawbacks: (i) computational inefficiency due to repeatedly applying clustering algorithms, and (ii) lack of interpretability into the important features for differentiating clusters. In this paper, we address these two challenges by developing IMPACC: Interpretable MiniPatch Adaptive Consensus Clustering. Our approach adopts three major innovations. We ensemble cluster co-occurrences from tiny subsets of both observations and features, termed minipatches, thus dramatically reducing computation time. Additionally, we develop adaptive sampling schemes for observations, which result in both improved reliability and computational savings, as well as adaptive sampling schemes of features, which lead to interpretable solutions by quickly learning the most relevant features that differentiate clusters. We study our approach on synthetic data and a variety of real large-scale bioinformatics data sets; results show that our approach not only yields more accurate and interpretable cluster solutions, but it also substantially improves computational efficiency compared to standard consensus clustering approaches.


Assuntos
Algoritmos , Biologia Computacional , Análise por Conglomerados , Biologia Computacional/métodos , Consenso , Reprodutibilidade dos Testes
7.
J Comput Biol ; 29(5): 465-482, 2022 05.
Artigo em Inglês | MEDLINE | ID: mdl-35325552

RESUMO

Recent advances in single-cell RNA sequencing (scRNA-seq) technologies have yielded a powerful tool to measure gene expression of individual cells. One major challenge of the scRNA-seq data is that it usually contains a large amount of zero expression values, which often impairs the effectiveness of downstream analyses. Numerous data imputation methods have been proposed to deal with these "dropout" events, but this is a difficult task for such high-dimensional and sparse data. Furthermore, there have been debates on the nature of the sparsity, about whether the zeros are due to technological limitations or represent actual biology. To address these challenges, we propose Single-cell RNA-seq Correlation completion by ENsemble learning and Auxiliary information (SCENA), a novel approach that imputes the correlation matrix of the data of interest instead of the data itself. SCENA obtains a gene-by-gene correlation estimate by ensembling various individual estimates, some of which are based on known auxiliary information about gene expression networks. Our approach is a reliable method that makes no assumptions on the nature of sparsity in scRNA-seq data or the data distribution. By extensive simulation studies and real data applications, we demonstrate that SCENA is not only superior in gene correlation estimation, but also improves the accuracy and reliability of downstream analyses, including cell clustering, dimension reduction, and graphical model estimation to learn the gene expression network.


Assuntos
Perfilação da Expressão Gênica , Análise de Célula Única , Análise por Conglomerados , Simulação por Computador , RNA-Seq , Reprodutibilidade dos Testes , Análise de Sequência de RNA/métodos , Análise de Célula Única/métodos
8.
BMC Biol ; 20(1): 28, 2022 01 28.
Artigo em Inglês | MEDLINE | ID: mdl-35086530

RESUMO

BACKGROUND: The functional understanding of genetic interaction networks and cellular mechanisms governing health and disease requires the dissection, and multifaceted study, of discrete cell subtypes in developing and adult animal models. Recombinase-driven expression of transgenic effector alleles represents a significant and powerful approach to delineate cell populations for functional, molecular, and anatomical studies. In addition to single recombinase systems, the expression of two recombinases in distinct, but partially overlapping, populations allows for more defined target expression. Although the application of this method is becoming increasingly popular, its experimental implementation has been broadly restricted to manipulations of a limited set of common alleles that are often commercially produced at great expense, with costs and technical challenges associated with production of intersectional mouse lines hindering customized approaches to many researchers. Here, we present a simplified CRISPR toolkit for rapid, inexpensive, and facile intersectional allele production. RESULTS: Briefly, we produced 7 intersectional mouse lines using a dual recombinase system, one mouse line with a single recombinase system, and three embryonic stem (ES) cell lines that are designed to study the way functional, molecular, and anatomical features relate to each other in building circuits that underlie physiology and behavior. As a proof-of-principle, we applied three of these lines to different neuronal populations for anatomical mapping and functional in vivo investigation of respiratory control. We also generated a mouse line with a single recombinase-responsive allele that controls the expression of the calcium sensor Twitch-2B. This mouse line was applied globally to study the effects of follicle-stimulating hormone (FSH) and luteinizing hormone (LH) on calcium release in the ovarian follicle. CONCLUSIONS: The lines presented here are representative examples of outcomes possible with the successful application of our genetic toolkit for the facile development of diverse, modifiable animal models. This toolkit will allow labs to create single or dual recombinase effector lines easily for any cell population or subpopulation of interest when paired with the appropriate Cre and FLP recombinase mouse lines or viral vectors. We have made our tools and derivative intersectional mouse and ES cell lines openly available for non-commercial use through publicly curated repositories for plasmid DNA, ES cells, and transgenic mouse lines.


Assuntos
Cálcio , Repetições Palindrômicas Curtas Agrupadas e Regularmente Espaçadas , Animais , Feminino , Integrases/genética , Integrases/metabolismo , Camundongos , Camundongos Transgênicos , Neurônios/fisiologia , Recombinases/genética , Recombinases/metabolismo
9.
J Mach Learn Res ; 222021 Jan.
Artigo em Inglês | MEDLINE | ID: mdl-34744522

RESUMO

In mixed multi-view data, multiple sets of diverse features are measured on the same set of samples. By integrating all available data sources, we seek to discover common group structure among the samples that may be hidden in individualistic cluster analyses of a single data view. While several techniques for such integrative clustering have been explored, we propose and develop a convex formalization that enjoys strong empirical performance and inherits the mathematical properties of increasingly popular convex clustering methods. Specifically, our Integrative Generalized Convex Clustering Optimization (iGecco) method employs different convex distances, losses, or divergences for each of the different data views with a joint convex fusion penalty that leads to common groups. Additionally, integrating mixed multi-view data is often challenging when each data source is high-dimensional. To perform feature selection in such scenarios, we develop an adaptive shifted group-lasso penalty that selects features by shrinking them towards their loss-specific centers. Our so-called iGecco+ approach selects features from each data view that are best for determining the groups, often leading to improved integrative clustering. To solve our problem, we develop a new type of generalized multi-block ADMM algorithm using sub-problem approximations that more efficiently fits our model for big data sets. Through a series of numerical experiments and real data examples on text mining and genomics, we show that iGecco+ achieves superior empirical performance for high-dimensional mixed multi-view data.

10.
Artigo em Inglês | MEDLINE | ID: mdl-34734115

RESUMO

Boosting methods are among the best general-purpose and off-the-shelf machine learning approaches, gaining widespread popularity. In this paper, we seek to develop a boosting method that yields comparable accuracy to popular AdaBoost and gradient boosting methods, yet is faster computationally and whose solution is more interpretable. We achieve this by developing MP-Boost, an algorithm loosely based on AdaBoost that learns by adaptively selecting small subsets of instances and features, or what we term minipatches (MP), at each iteration. By sequentially learning on tiny subsets of the data, our approach is computationally faster than other classic boosting algorithms. Also as it progresses, MP-Boost adaptively learns a probability distribution on the features and instances that upweight the most important features and challenging instances, hence adaptively selecting the most relevant minipatches for learning. These learned probability distributions also aid in interpretation of our method. We empirically demonstrate the interpretability, comparative accuracy, and computational time of our approach on a variety of binary classification tasks.

11.
Artigo em Inglês | MEDLINE | ID: mdl-34746376

RESUMO

Ridge-like regularization often leads to improved generalization performance of machine learning models by mitigating overfitting. While ridge-regularized machine learning methods are widely used in many important applications, direct training via optimization could become challenging in huge data scenarios with millions of examples and features. We tackle such challenges by proposing a general approach that achieves ridge-like regularization through implicit techniques named Minipatch Ridge (MPRidge). Our approach is based on taking an ensemble of coefficients of unregularized learners trained on many tiny, random subsamples of both the examples and features of the training data, which we call minipatches. We empirically demonstrate that MPRidge induces an implicit ridge-like regularizing effect and performs nearly the same as explicit ridge regularization for a general class of predictors including logistic regression, SVM, and robust regression. Embarrassingly parallelizable, MPRidge provides a computationally appealing alternative to inducing ridge-like regularization for improving generalization performance in challenging big-data settings.

12.
Artigo em Inglês | MEDLINE | ID: mdl-34278383

RESUMO

Central venous pressure (CVP) is the blood pressure in the venae cavae, near the right atrium of the heart. This signal waveform is commonly collected in clinical settings, and yet there has been limited discussion of using this data for detecting arrhythmia and other cardiac events. In this paper, we develop a signal processing and feature engineering pipeline for CVP waveform analysis. Through a case study on pediatric junctional ectopic tachycardia (JET), we show that our extracted CVP features reliably detect JET with comparable results to the more commonly used electrocardiogram (ECG) features. This machine learning pipeline can thus improve the clinical diagnosis and ICU monitoring of arrhythmia. It also corroborates and complements the ECG-based diagnosis, especially when the ECG measurements are unavailable or corrupted.

13.
Elife ; 102021 04 19.
Artigo em Inglês | MEDLINE | ID: mdl-33871358

RESUMO

Most research on neurodegenerative diseases has focused on neurons, yet glia help form and maintain the synapses whose loss is so prominent in these conditions. To investigate the contributions of glia to Huntington's disease (HD), we profiled the gene expression alterations of Drosophila expressing human mutant Huntingtin (mHTT) in either glia or neurons and compared these changes to what is observed in HD human and HD mice striata. A large portion of conserved genes are concordantly dysregulated across the three species; we tested these genes in a high-throughput behavioral assay and found that downregulation of genes involved in synapse assembly mitigated pathogenesis and behavioral deficits. To our surprise, reducing dNRXN3 function in glia was sufficient to improve the phenotype of flies expressing mHTT in neurons, suggesting that mHTT's toxic effects in glia ramify throughout the brain. This supports a model in which dampening synaptic function is protective because it attenuates the excitotoxicity that characterizes HD.


When a neuron dies, through injury or disease, the body loses all communication that passes through it. The brain compensates by rerouting the flow of information through other neurons in the network. Eventually, if the loss of neurons becomes too great, compensation becomes impossible. This process happens in Alzheimer's, Parkinson's, and Huntington's disease. In the case of Huntington's disease, the cause is mutation to a single gene known as huntingtin. The mutation is present in every cell in the body but causes particular damage to parts of the brain involved in mood, thinking and movement. Neurons and other cells respond to mutations in the huntingtin gene by turning the activities of other genes up or down, but it is not clear whether all of these changes contribute to the damage seen in Huntington's disease. In fact, it is possible that some of the changes are a result of the brain trying to protect itself. So far, most research on this subject has focused on neurons because the huntingtin gene plays a role in maintaining healthy neuronal connections. But, given that all cells carry the mutated gene, it is likely that other cells are also involved. The glia are a diverse group of cells that support the brain, providing care and sustenance to neurons. These cells have a known role in maintaining the connections between neurons and may also have play a role in either causing or correcting the damage seen in Huntington's disease. The aim of Onur et al. was to find out which genes are affected by having a mutant huntingtin gene in neurons or glia, and whether severity of Huntington's disease improved or worsened when the activity of these genes changed. First, Onur et al. identified genes affected by mutant huntingtin by comparing healthy human brains to the brains of people with Huntington's disease. Repeating the same comparison in mice and fruit flies identified genes affected in the same way across all three species, revealing that, in Huntington's disease, the brain dials down glial cell genes involved in maintaining neuronal connections. To find out how these changes in gene activity affect disease severity and progression, Onur et al. manipulated the activity of each of the genes they had identified in fruit flies that carried mutant versions of huntingtin either in neurons, in glial cells or in both cell types. They then filmed the flies to see the effects of the manipulation on movement behaviors, which are affected by Huntington's disease. This revealed that purposely lowering the activity of the glial genes involved in maintaining connections between neurons improved the symptoms of the disease, but only in flies who had mutant huntingtin in their glial cells. This indicates that the drop in activity of these genes observed in Huntington's disease is the brain trying to protect itself. This work suggests that it is important to include glial cells in studies of neurological disorders. It also highlights the fact that changes in gene expression as a result of a disease are not always bad. Many alterations are compensatory, and try to either make up for or protect cells affected by the disease. Therefore, it may be important to consider whether drugs designed to treat a condition by changing levels of gene activity might undo some of the body's natural protection. Working out which changes drive disease and which changes are protective will be essential for designing effective treatments.


Assuntos
Encéfalo/metabolismo , Proteínas de Drosophila/metabolismo , Sinapses Elétricas/metabolismo , Proteína Huntingtina/metabolismo , Doença de Huntington/metabolismo , Neuroglia/metabolismo , Transmissão Sináptica , Animais , Comportamento Animal , Encéfalo/patologia , Encéfalo/fisiopatologia , Estudos de Casos e Controles , Moléculas de Adesão Celular Neuronais/genética , Moléculas de Adesão Celular Neuronais/metabolismo , Linhagem Celular , Modelos Animais de Doenças , Proteínas de Drosophila/genética , Drosophila melanogaster , Sinapses Elétricas/patologia , Feminino , Redes Reguladoras de Genes , Humanos , Proteína Huntingtina/genética , Doença de Huntington/genética , Doença de Huntington/patologia , Doença de Huntington/fisiopatologia , Locomoção , Masculino , Camundongos Transgênicos , Mutação , Neuroglia/patologia , Transcriptoma , alfa 1-Antitripsina/genética , alfa 1-Antitripsina/metabolismo
14.
Hepatology ; 73(6): 2278-2292, 2021 06.
Artigo em Inglês | MEDLINE | ID: mdl-32931023

RESUMO

BACKGROUND AND AIMS: Therapeutic, clinical trial entry and stratification decisions for hepatocellular carcinoma (HCC) are made based on prognostic assessments, using clinical staging systems based on small numbers of empirically selected variables that insufficiently account for differences in biological characteristics of individual patients' disease. APPROACH AND RESULTS: We propose an approach for constructing risk scores from circulating biomarkers that produce a global biological characterization of individual patient's disease. Plasma samples were collected prospectively from 767 patients with HCC and 200 controls, and 317 proteins were quantified in a Clinical Laboratory Improvement Amendments-certified biomarker testing laboratory. We constructed a circulating biomarker aberration score for each patient, a score between 0 and 1 that measures the degree of aberration of his or her biomarker panel relative to normal, which we call HepatoScore. We used log-rank tests to assess its ability to substratify patients within existing staging systems/prognostic factors. To enhance clinical application, we constructed a single-sample score, HepatoScore-14, which requires only a subset of 14 representative proteins encompassing the global biological effects. Patients with HCC were split into three distinct groups (low, medium, and high HepatoScore) with vastly different prognoses (medial overall survival 38.2/18.3/7.1 months; P < 0.0001). Furthermore, HepatoScore accurately substratified patients within levels of existing prognostic factors and staging systems (P < 0.0001 for nearly all), providing substantial and sometimes dramatic refinement of expected patient outcomes with strong therapeutic implications. These results were recapitulated by HepatoScore-14, rigorously validated in repeated training/test splits, concordant across Myriad RBM (Austin, TX) and enzyme-linked immunosorbent assay kits, and established as an independent prognostic factor. CONCLUSIONS: HepatoScore-14 augments existing HCC staging systems, dramatically refining patient prognostic assessments and therapeutic decision making and enrollment in clinical trials. The underlying strategy provides a global biological characterization of disease, and can be applied broadly to other disease settings and biological media.


Assuntos
Biomarcadores Tumorais/sangue , Carcinoma Hepatocelular/sangue , Neoplasias Hepáticas/sangue , Índice de Gravidade de Doença , Carcinoma Hepatocelular/patologia , Estudos de Casos e Controles , Feminino , Humanos , Neoplasias Hepáticas/patologia , Masculino , Valor Preditivo dos Testes , Prognóstico , Modelos de Riscos Proporcionais , Fatores de Risco
15.
PLoS One ; 15(11): e0241707, 2020.
Artigo em Inglês | MEDLINE | ID: mdl-33152028

RESUMO

Even though there is a clear link between Alzheimer's Disease (AD) related neuropathology and cognitive decline, numerous studies have observed that healthy cognition can exist in the presence of extensive AD pathology, a phenomenon sometimes called Cognitive Resilience (CR). To better understand and study CR, we develop the Alzheimer's Disease Cognitive Resilience Score (AD-CR Score), which we define as the difference between the observed and expected cognition given the observed level of AD pathology. Unlike other definitions of CR, our AD-CR Score is a fully non-parametric, stand-alone, individual-level quantification of CR that is derived independently of other factors or proxy variables. Using data from two ongoing, longitudinal cohort studies of aging, the Religious Orders Study (ROS) and the Rush Memory and Aging Project (MAP), we validate our AD-CR Score by showing strong associations with known factors related to CR such as baseline and longitudinal cognition, non AD-related pathology, education, personality, APOE, parkinsonism, depression, and life activities. Even though the proposed AD-CR Score cannot be directly calculated during an individual's lifetime because it uses postmortem pathology, we also develop a machine learning framework that achieves promising results in terms of predicting whether an individual will have an extremely high or low AD-CR Score using only measures available during the lifetime. Given this, our AD-CR Score can be used for further investigations into mechanisms of CR, and potentially for subject stratification prior to clinical trials of personalized therapies.


Assuntos
Doença de Alzheimer/diagnóstico , Doença de Alzheimer/fisiopatologia , Disfunção Cognitiva/diagnóstico , Disfunção Cognitiva/fisiopatologia , Estudos de Coortes , Humanos , Estudos Longitudinais
16.
J Comput Graph Stat ; 29(1): 87-96, 2020.
Artigo em Inglês | MEDLINE | ID: mdl-32982130

RESUMO

Convex clustering is a promising new approach to the classical problem of clustering, combining strong performance in empirical studies with rigorous theoretical foundations. Despite these advantages, convex clustering has not been widely adopted, due to its computationally intensive nature and its lack of compelling visualizations. To address these impediments, we introduce Algorithmic Regularization, an innovative technique for obtaining high-quality estimates of regularization paths using an iterative one-step approximation scheme. We justify our approach with a novel theoretical result, guaranteeing global convergence of the approximate path to the exact solution under easily-checked non-data-dependent assumptions. The application of algorithmic regularization to convex clustering yields the Convex Clustering via Algorithmic Regularization Paths (CARP) algorithm for computing the clustering solution path. On example data sets from genomics and text analysis, CARP delivers over a 100-fold speed-up over existing methods, while attaining a finer approximation grid than standard methods. Furthermore, CARP enables improved visualization of clustering solutions: the fine solution grid returned by CARP can be used to construct a convex clustering-based dendrogram, as well as forming the basis of a dynamic path-wise visualization based on modern web technologies. Our methods are implemented in the open-source R package clustRviz, available at https://github.com/DataSlingers/clustRviz.

17.
ACM BCB ; 20202020 Sep.
Artigo em Inglês | MEDLINE | ID: mdl-34278382

RESUMO

Single cell RNA sequencing is a powerful technique that measures the gene expression of individual cells in a high throughput fashion. However, due to sequencing inefficiency, the data is unreliable due to dropout events, or technical artifacts where genes erroneously appear to have zero expression. Many data imputation methods have been proposed to alleviate this issue. Yet, effective imputation can be difficult and biased because the data is sparse and high-dimensional, resulting in major distortions in downstream analyses. In this paper, we propose a completely novel approach that imputes the gene-by-gene correlations rather than the data itself. We call this method SCENA: Single cell RNA-seq Correlation completion by ENsemble learning and Auxiliary information. The SCENA gene-by-gene correlation matrix estimate is obtained by model stacking of multiple imputed correlation matrices based on known auxiliary information about gene connections. In an extensive simulation study based on real scRNA-seq data, we demonstrate that SCENA not only accurately imputes gene correlations but also outperforms existing imputation approaches in downstream analyses such as dimension reduction, cell clustering, graphical model estimation.

18.
Neuroimage ; 197: 330-343, 2019 08 15.
Artigo em Inglês | MEDLINE | ID: mdl-31029870

RESUMO

Advanced brain imaging techniques make it possible to measure individuals' structural connectomes in large cohort studies non-invasively. Given the availability of large scale data sets, it is extremely interesting and important to build a set of advanced tools for structural connectome extraction and statistical analysis that emphasize both interpretability and predictive power. In this paper, we developed and integrated a set of toolboxes, including an advanced structural connectome extraction pipeline and a novel tensor network principal components analysis (TN-PCA) method, to study relationships between structural connectomes and various human traits such as alcohol and drug use, cognition and motion abilities. The structural connectome extraction pipeline produces a set of connectome features for each subject that can be organized as a tensor network, and TN-PCA maps the high-dimensional tensor network data to a lower-dimensional Euclidean space. Combined with classical hypothesis testing, canonical correlation analysis and linear discriminant analysis techniques, we analyzed over 1100 scans of 1076 subjects from the Human Connectome Project (HCP) and the Sherbrooke test-retest data set, as well as 175 human traits measuring different domains including cognition, substance use, motor, sensory and emotion. The test-retest data validated the developed algorithms. With the HCP data, we found that structural connectomes are associated with a wide range of traits, e.g., fluid intelligence, language comprehension, and motor skills are associated with increased cortical-cortical brain structural connectivity, while the use of alcohol, tobacco, and marijuana are associated with decreased cortical-cortical connectivity. We also demonstrated that our extracted structural connectomes and analysis method can give superior prediction accuracies compared with alternative connectome constructions and other tensor and network regression methods.


Assuntos
Encéfalo/anatomia & histologia , Conectoma/métodos , Imagem de Tensor de Difusão/métodos , Processamento de Imagem Assistida por Computador/métodos , Personalidade/fisiologia , Encéfalo/diagnóstico por imagem , Interpretação Estatística de Dados , Feminino , Humanos , Masculino , Modelos Neurológicos , Vias Neurais/anatomia & histologia , Análise de Componente Principal
19.
PLoS One ; 13(9): e0203007, 2018.
Artigo em Inglês | MEDLINE | ID: mdl-30204756

RESUMO

Several modern genomic technologies, such as DNA-Methylation arrays, measure spatially registered probes that number in the hundreds of thousands across multiple chromosomes. The measured probes are by themselves less interesting scientifically; instead scientists seek to discover biologically interpretable genomic regions comprised of contiguous groups of probes which may act as biomarkers of disease or serve as a dimension-reducing pre-processing step for downstream analyses. In this paper, we introduce an unsupervised feature learning technique which maps technological units (probes) to biological units (genomic regions) that are common across all subjects. We use ideas from fusion penalties and convex clustering to introduce a method for Spatial Convex Clustering, or SpaCC. Our method is specifically tailored to detecting multi-subject regions of methylation, but we also test our approach on the well-studied problem of detecting segments of copy number variation. We formulate our method as a convex optimization problem, develop a massively parallelizable algorithm to find its solution, and introduce automated approaches for handling missing values and determining tuning parameters. Through simulation studies based on real methylation and copy number variation data, we show that SpaCC exhibits significant performance gains relative to existing methods. Finally, we illustrate SpaCC's advantages as a pre-processing technique that reduces large-scale genomics data into a smaller number of genomic regions through several cancer epigenetics case studies on subtype discovery, network estimation, and epigenetic-wide association.


Assuntos
Genômica/métodos , Neoplasias da Mama/genética , Análise por Conglomerados , Simulação por Computador , Variações do Número de Cópias de DNA , Metilação de DNA , Feminino , Genoma , Humanos , Neoplasias Ovarianas/genética , Análise Espacial , Aprendizado de Máquina não Supervisionado
20.
Bioinformatics ; 34(7): 1141-1147, 2018 04 01.
Artigo em Inglês | MEDLINE | ID: mdl-29617963

RESUMO

Motivation: Batch effects are one of the major source of technical variations that affect the measurements in high-throughput studies such as RNA sequencing. It has been well established that batch effects can be caused by different experimental platforms, laboratory conditions, different sources of samples and personnel differences. These differences can confound the outcomes of interest and lead to spurious results. A critical input for batch correction algorithms is the knowledge of batch factors, which in many cases are unknown or inaccurate. Hence, the primary motivation of our paper is to detect hidden batch factors that can be used in standard techniques to accurately capture the relationship between gene expression and other modeled variables of interest. Results: We introduce a new algorithm based on data-adaptive shrinkage and semi-Non-negative Matrix Factorization for the detection of unknown batch effects. We test our algorithm on three different datasets: (i) Sequencing Quality Control, (ii) Topotecan RNA-Seq and (iii) Single-cell RNA sequencing (scRNA-Seq) on Glioblastoma Multiforme. We have demonstrated a superior performance in identifying hidden batch effects as compared to existing algorithms for batch detection in all three datasets. In the Topotecan study, we were able to identify a new batch factor that has been missed by the original study, leading to under-representation of differentially expressed genes. For scRNA-Seq, we demonstrated the power of our method in detecting subtle batch effects. Availability and implementation: DASC R package is available via Bioconductor or at https://github.com/zhanglabNKU/DASC. Contact: zhanghan@nankai.edu.cn or zhandonl@bcm.edu. Supplementary information: Supplementary data are available at Bioinformatics online.


Assuntos
Algoritmos , Perfilação da Expressão Gênica/métodos , Controle de Qualidade , Projetos de Pesquisa , Análise de Sequência de RNA/métodos , Glioblastoma/genética , Humanos , Topotecan/farmacologia
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA
...